Fast Algorithms for Reconciliation under Hybridization and 

Incomplete Lineage Sorting* 



Yun Yu and Luay Nakhleh 

Department of Computer Science, Rice University, 
6100 Main Street, Houston, TX 77005, USA 
E-mail: {yy9, nakhleh} @cs.rice.edu. 



Abstract 

Reconciling a gene tree with a species tree is an important task that reveals much about the evolu- 
tion of genes, genomes, and species, as well as about the molecular function of genes. A wide array of 
computational tools have been devised for this task under certain evolutionary events such as hybridiza- 
tion, gene duplication/loss, or incomplete lineage sorting. Work on reconciling gene tree with species 
phylogenies under two or more of these events have also begun to emerge. Our group recently devised 
both parsimony and probabilistic frameworks for reconciling a gene tree with a phylogenetic network, 
thus allowing for the detection of hybridization in the presence of incomplete lineage sorting. While 
the frameworks were general and could handle any topology, they are computationally intensive, ren- 
dering their application to large datasets infeasible. In this paper, we present two novel approaches to 
address the computational challenges of the two frameworks that are based on the concept of ancestral 
configurations. Our approaches still compute exact solutions while improving the computational time by 
up to five orders of magnitude. These substantial gains in speed scale the applicability of these unified 
reconciliation frameworks to much larger data sets. We discuss how the topological features of the gene 
tree and phylogenetic network may affect the performance of the new algorithms. We have implemented 
the algorithms in our PhyloNet software package, which is publicly available in open source. 
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1 Introduction 

Analysis of the increasingly available genomic data continue to reveal the extent of hybridization and its 
importance in the speciation and evolutionary innovations of several groups of species and animals |[Tll2l[T8l 
[HI [29]]. When hybridization occurs, the evolutionary history of the species and their genomes is reticulate 
and best modeled by a phylogenetic network which, in our context, is a special type of rooted, directed, 
acyclic graphs ETTl . Methods have been devised for inferring phylogenetic networks from pairs of gene 
trees (e.g., (3] [161 EU [23]]), larger collections of gene trees IT331 l34l l24l l27l . and directly from sequence 
data (e.g., |22j [TTJ [6] |25j [26]]). A salient feature of all these methods is that the incongruence of gene 
tree topologies, and more generally the heterogeneity among the different loci, is caused solely by reticulate 
evolutionary events such as horizontal gene transfer or hybridization. 

While hybridization causes incongruence among gene trees, other evolutionary events can also result 
in incongruence, such as incomplete lineage sorting (ILS) and gene duplication/loss ifTTl . In particular, as 
(successful) hybridization occurs between closely related species, it is important to account simultaneously 
for incomplete lineage sorting, a phenomenon that arises in similar situations lfT4l[T8ll . While a wide array of 
methods have been devised for inference under ILS along (see (4] [15]] for recent surveys), it is important to 
integrate both hybridization and ILS into a single framework for inference. Needless to say, it is important 
to integrate all sources of incongruence into a single framework, but that is much beyond the scope of this 
paper. The main task, then, becomes: given a gene tree topology and a phylogenetic network, to reconcile the 
gene tree within the branches of the phylogenetic network, thus allowing simultaneously for hybridization 
and ILS. When a method for achieving this task is "wrapped" by a strategy for searching the phylogenetic 
network space, the result is a method for inferring reticulate evolutionary histories in the presence of both 
hybridization and ILS. Therefore, is is very important to solve the reconciliation problem. 

Indeed, in the last five years, several attempts have been made, following different approaches, to address 
the problem of inferring hybridization in the presence of ILS IT3T1 [8] [20] [13] Q2] [38l . However, due to 
the computational challenges of the problem, these methods focused on very limited cases: fewer than 5 
taxa, one or two hybridization events, and a single allele sampled per species. More recently, our group 
proposed two methods for detecting hybridization in the presence of incomplete lineage sorting, including a 
probabilistic method which computes the probability of gene tree topologies given a phylogenetic network 
071 and a parsimony method which computes the minimum number of extra lineages IfTTl required to 
reconcile a gene tree within the branches of a phylogenetic network ||36l . While these methods are general 
in terms of the topologies and sizes of gene trees and phylogenetic networks, they are computationally 
intensive. In particular, these methods convert a phylogenetic network to a special type of trees, called multil- 
labeled trees (MUL-trees), and conduct computation on these trees while accounting for every possible 
mapping of genes to their leaves. This computation can be exponential in the number of leaves, and does 
explicit computations of coalescent histories of the gene genealogies. 

In this paper, we propose a novel way of computing the probability of gene tree topologies given a 
phylogenetic network, and a novel way of computing the minimum number of extra lineages of a gene tree 
and a phylogenetic network. Both of them use the concept of ancestral configuration (or AC) which was 
introduced very recently for computing the probability of gene tree topologies given a species tree [35 ]. The 
new algorithms are exact and much more efficient than the two MUL-tree based algorithms we introduced in 
|[37l[36l . In our extensive simulation studies, we compared the running time of the new AC -based methods 
with the previous MUL-tree based ones. We show that the new algorithms can speed up the computation by 
up to 5 orders of magnitude, thus allowing for the analysis of much larger data sets. Furthermore, we discuss 
how the running time of the new methods is still affected by the topologies of the species networks, more 
specifically the configurations of reticulation nodes, and the topologies of gene trees. All methods described 
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in this paper have been implemented in the PhyloNet software package ll32l which is freely available for 
download in open source at http://bioinfo.cs.rice.edu/phylonet 

2 Background 

In this work, we assume the following definition of phylogenetic networks [21 ]. 

Definition 1 A phylogenetic 56 -network, or 56 -network for short, N is an ordered pair (G, £), where 
G = (V, E) is a directed, acyclic graph (DAG) with V = {r} U Vl U Vt U Vn, where (I) indeg(r) = 
(r is the root of N); (2) \/v G Vt,, indeg(v) = 1 and outdeg(v) = (Vl are the external tree nodes, or 
leaves, of N); (3) \/v G Vt, indeg(v) = 1 and outdeg(v) > 2 (Vt are the internal tree nodes of N); and, 
(4) Vu G V/v, indegiv) = 2 and outdeg(v) = 1 (Vn are the reticulation nodes of N); E C V x V are the 
network's edges , and I : Vl —> 56 is the leaf-labeling function, which is a bijection from Vl to 36 . 

For the probabilistic setting of the problem, we also associate with every pair of reticulation edges inheri- 
tance probabilities 7( Ulj „) and 7( U2) „) sucn that 7( Uli „) + l(ui,v) = 1- Inheritance probability j( UjV ) indicates 
the proportion of alleles in population v that are inherited from population u. A gene tree is a phylogenetic 
network with no reticulation nodes. 

The way in which a gene evolves within the the branches of a phylogenetic network can be described 
by a coalescent history (37). Let N be a phylogenetic network. We denote by V(N) the set of nodes in 
N and by N u the set of nodes that are reachable from the root of N via at least one path that goes through 
node u G V(N). Given a phylogenetic network N and a gene tree g, a coalescent history is a function 
h : V(g) — > V(N) such that the following two conditions hold: (1) if v is a leaf in g, then h(v) is the leaf 
in ./V with the same label (in the case of multiple alleles, h(v) is the leaf in N with the label of the species 
from which the allele labeling leaf v in g is sampled); and, (2) if v is a node in g u , then h(v) is a node in 
N h ^ u y See Fig.|3]in the Appendix for an illustration. 

Given a phylogenetic network N and a gene tree g, we denote by H^{g) the set of all coalescent 
histories. Then the probability of observing gene tree g given phylogenetic network N is 

P(g\N)= Y, P ( h \ N )> d) 

heH N (g) 

where P(h\N) is the probability of coalescent history h given phylogenetic network N (along with its 
branch lengths and inheritance probabilities). Coalescent histories can also be used to compute the minimum 
number of extra lineages required to reconcile gene tree g with N, which we denote by XL(N, g), as 

XL(N,g)= min XL(N,h), (2) 

heH N (g) 

Methods for computing P(g\N) and XL(N, g) when N is a tree were recently given in [5 ] and |f30l . 
respectively. Recently, we proposed new methods for computing these two quantities when N is a phylo- 
genetic network ||37l |36l . The basic idea of both of these methods is to convert the phylogenetic network 
N into a MUL-tree T and then make use of some existing techniques to complete the computation on T 
instead of on N. A MUL-tree is a tree whose leaves are not uniquely labeled by a set of taxa. Therefore, 
alleles sampled from one species, say x, can map to any of the leaves in the MUL-tree T that are labeled 
by x. For network on taxa 56 , we denote by A x the set of alleles sampled from species x (x G 56), 
and by c x the set of leaves in T that are labeled by species x. Then a valid allele mapping is a function 
/ : (U xe &A x ) — > iy> x ^.%' c x) such that if f(a) = d, and d G c x , then a G A x 071 . Fig. [4] in the Ap- 
pendix shows an example of converting a phylogenetic network into a MUL-tree along with all valid allele 
mappings when single allele is sampled per species. 
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Suppose T is the MUL-tree converted from network N. We denote by ^T,g the set of all valid allele 
mappings for MUL-tree T and gene tree g. Then the probability of observing gene tree g given TV can be 
computed using MUL-tree T as follows 

P(9\N)= Yl E P ( h \ T >f)> ( 3 ) 

/e^T, 9 heH T>f ( g ) 

where Hxj(g) is the set of coalescent histories of g within MUL-tree T under valid allele mapping /, and 
P(h\T, f) is the probability of observing coalescent history h within T under / 1371 . Furthermore, the 
minimum number of extra lineages required to reconcile gene tree g with N can also be computed using 
MUL-tree T by 

XL(N,g)= min min XL(T,f,h), (4) 

fe&T, B heH T j(g) 

where XL(T, /, h) is the total number of extra lineages of coalescent history h within T under allele map- 
ping / 136). 

The advantage of the MUL-tree based techniques is that once the network is converted to the MUL-tree, 
tree-based techniques from the multi-species coalescent theory apply with minimal revision. Nonetheless, 
from Eq. Q and Eq. Q we can see that the running time of both two methods depend on the number of 
valid allele mappings. Let Vl(N) be the set of leaves of N, and a(x) be the number of alleles sampled 
from some x in Vl(N) in gene tree g. Then the number of valid allele mappings of N and g is bounded 
from below and above by 2^ v lW r ™™M a W and 2 E »evb(iN0 rmojWa(j:) , respectively, where r min {x) and 
i"max{x) are the minimum and maximum number of reticulation nodes on any path from leaf x in Vl(N) to 
the root of N respectively. We can see that when the number of taxa or sampled alleles increases, or when 
the number of reticulation nodes increases, this number can quickly become very large which makes the 
computations prohibitive. Furthermore, computing term P(h\T, /) in Eq. Q using coalescent histories will 
become infeasible when the number of taxa or sampled alleles increases ll35ll . 

3 Methods 

Central to our methods is the concept of ancestral configuration (or simply configuration, or AC). When it 
was first introduced, it was defined on species trees for computing the probability of gene tree topologies 
ll35l . In this work, we extend it to species networks. Given a species network N and a gene tree g, an 
ancestral configuration at node v of N, which we denote by AC V (the subscript v may be omitted when the 
identity of node v is clear from the context), is a set of gene lineages at node v under some coalescent history 
h in H^(g)- The number of gene lineages in configuration AC V is denoted by n(AC v ). For example, given 
the coalescent history /13 shown in Fig. [5J for reticulation node v, we have AC = {61, 62} and n(AC) = 2; 
for the root of N, we have AC = {a, c, y} and n(AC) = 3. Furthermore, we denote by s^ c € v a set of pairs 
(a, w) where a is a configuration at node v of N and w is the weight of a, and by a set of (a, w) 

where a is a configuration that about to leave branch (u, v) of N and w is the weight of a. We will discuss 
how to set/use the weight w below. 

Assume m and n are two gene lineages that meet at some node in a gene tree g. When reconciling 
g within the branches of a species network N, after they two entered the same branch of N, they might 
or might not have coalesced before leaving that branch, the probability of which depends on the length 
(in terms of time) and width (in terms of population size) of that branch. Therefore, one configuration 
entering a branch of N might give rise to several different configurations leaving that branch with different 
probabilities. For example, suppose a gene tree g has a subtree ((a, b)x, c)y (tree with root y, leaf-child 
c of the root, child x of the root, and two leaves a and b that are children of x). Then if a configuration 
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{a, b, c} entered a branch of N, it could give rise to one of three different configurations leaving that branch, 
including {a, b, c] {x, c] and {y}. We denote by Coal(AC, g), for configuration AC and gene tree g, the 
set of all configurations that AC might coalesce into with respect to the topology of g. We now show how 
to use configurations to compute P(g\N) and XL(N, g) efficiently. 
3.1 Counting the number of extra lineages 

For a configuration AC, we denote by xl(AC) the minimum total number of extra lineages on all branches 
that the extant gene lineages in AC having passed through from time to coalesce into the present gene 
lineages in AC. In this method, weight w in (AC, w) £ srf^ corresponds to xl(AC), where srf'io is either 
si c € v where v is a node or st/ffb where b is a branch. 

Observation 1 Let AC be a configuration entering a branch b and AC + be a configuration that AC coa- 
lesced into when leaving b. Then 

xl(AC + ) = xl(AC) + n(AC + ) - 1, (5) 
where n(AC + ) — 1 is the number of extra lineages on branch b. 

We define a function called CreateCACsForXL which takes a gene tree g, a branch b = (u, v) of 
the network TV and a set of configuration- weight pairs s^ c € v that enter branch b, and returns a set of 
configuration- weight pairs sf&fav) triat leave branch b. 



Algorithm 1: CreateCACsForXL. 

Input: Gene tree g, a branch b — (u, v), a set of configuration-weight pairs sftf v 
Output: A set of configuration-weight pairs ■srf t &( u , v ) 
foreach (AC,xl{AG)) € s? c € v do 

AC+ <- aigmm AC , eCoal{AC g) n{AC); 

Compute xl(AC + ) using Eq. J3J; 
_ ^ { u,v) <- **V lu , v) U (AC\xl(AC+)) ; 
return sd^ \ u ,v) 



Note that although one configuration can coalesce into several different configurations along a branch, 
under parsimony we only need to keep the one that has the minimum total number of extra lineages. There- 
fore |^^f v | = \^^{u,v)\ an d there is 1-1 correspondence between configurations in \sSf^ v \ and configura- 
tions in \£f&( u ,v)\- 

For a phylogenetic network N and a gene tree g, the algorithm for computing the minimum number of 
extra lineages required to reconcile g within N is shown in Alg. [2] Basically, we traverse the nodes of the 
network in post-order. For every node v we visit, we construct the set of configuration- weight pairs s^ c € v 
for node v based on its type. Recall that there are four types of nodes in a phylogenetic network, which are 
leaves, reticulation nodes, internal tree nodes, and the root. Finally when we arrive at the root of N, we are 
able to obtain XL(N, g). 

At a reticulate node v who has parents u\ and ui, every gene lineage could independently choose to go 
toward u\ or u%. So for every (AC, xl(AC)) in s^^€ v , there are 2 n ^ AC ' ) different ways of splitting AC into 
two configurations, say ACi and AC2, such that AC V = AC\ U AC2- For example, a configuration {a, b} 
can be split in four different ways including {a, b} and {0}, {a} and {b}, {b} and {a}, and {0} and {a, b}. 
It is important to keep track of those gene lineages that are originally coming from one splitting so that we 
could merge them back once they are in the same population again. Note that there is no need to consider 
coalescent events on branch (u\, v) or (112, v), because all gene lineages on these two branches were already 
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Algorithm 2: CountXL. 



Input: Phylogenetic network N, gene tree g 
Output: XL(N, g) 

while traversing the nodes of N in post-order do 
if node v is a leaf, who has parent u then 

srf c /ci' v <— {{AC, 0)} where AC is the set of leaves in g sampled from the species v which is labeled by; 
^(u,v) <- CreateCACsForXL(#, (u, v),af^ v ); 

else if node v is a reticulation node, who has child w, and two parents ui and 112 then 

foreach (AC,xl(AC)) € sPif v do 
foreach ACi C AC do 

AC 2 <-AC-A&; 

^( U1 ,v) <- ^(uuv) U (ACi,xl(AC) + n(Ad) - 1); 
£ftf (U2 , v) <- ^ c <f ( u 2 ,v) U (AC^n^Ca) - 1); 

else if node v is an internal tree node or root, who has two children W\ and W2 then 
foreach (ACi,xl(ACi)) e ntf^^ do 
foreach (AC 2, xl(ACi)) £ sPg^^ do 
if ACi and AC'2 are compatible then 

L si c € v <- sffv U (Ad U AC 2 ,xl(ACi) + xl(AC 2 )) 

if node v is an internal tree node, who has a parent u then 
\_ *P&( UlV ) <- CreateCACsForXL(g, (u, v), £/<& v ); 

else 

[_ return mm( AC , w i(AC))eat<<? v xl(AC)\ 



in the same population on branch (v,w). And under parsimony where all gene lineages are assumed to 
coalesce as soon as they can, all possible coalescent events that could happen among these gene lineages 
must have already been applied on branch (it, w). As a result, (AC\, xl(AC\)) and (AC2, xl(AC2)) can be 
put directly into si c €t U x,v) anc ^ ■ s ^ c ^'(u2,v) respectively. 

The compatibility of configurations in the algorithm is defined as follows. Two configurations are com- 
patible if for every reticulation node, either both configurations went through that node and had resulted 
from the same split of an ancestral configuration, or at least one of the two configurations did not go through 
that node. Fig. [5]in the Appendix illustrates configurations generated for every node and branch of a network 
given a gene tree. 

3.2 Calculating gene tree probability 

For a configuration AC, we denote by p(AC) the cumulative probability of the extant gene lineages in AC 
coalescing into the present gene lineages in AC from time 0. In this method, weight w in (AC, w) € srf'tf 
corresponds to p(AC), where is either ^'tfy where v is a node or g/ffb where b is a branch. 

Observation 2 Let AC be a configuration entering branch b of network N with branch length Then the 
probability of observing configuration AC + leaving branch b is 



w b (AC, AC^ 
d b (AC, AC+) ' 



Pt{AC,AC + ,b)=p n{AC)MAC+) (\ b ) W? J : , (6) 



where P n (AC),n(AO+) (^b) is me probability that n(AC) gene lineages coalesce into n(AC + ) gene lineages 
within time X b , w b (AC, AC + ) is the number of ways that coalescent events can occur along branch b to 
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coalesce AC into AC + with respect to the gene tree topology, and dj,{AC, AC + ) is the number of all 
possible orderings ofn(AC) — n{AC + ) coalescent events. 

The details of how to compute p n (AC),n(AC+)(^b)> wj,{AC, AC + ) and di{AC, AC + ) are given in @. 

Observation 3 Let AC be a configuration entering a branch b and AC + be a configuration that AC coa- 
lesced into when leaving b. Then 

p(AC + ) = p(AC) Pt (AC, AC+, b). (7) 

We define a function called CreateCACsForProb which takes a gene tree g, a branch b = {u, v) of the 
network N and a set of configuration- weight pairs si c € v that enter branch b, and returns a set of all possible 
configuration- weight pairs sf&fav) that leave branch b. 



Algorithm 3: CreateCACsForProb. 

Input: Gene tree g, a branch b — (u, v), a set of configuration-weight pairs g/ff, 
Output: A set of configuration-weight pairs stf^^^ 
foreach (AC,p(AC)) 6 s/V v do 
S<- Co&\{AC, g)\ 
foreach AC + eSdo 

Compute p(AC + ) using Eq. {7}; 
if ( AC + , w) g si^ \ u ,v) f or some weight w then 
[ mf- w + p(AC + ) 

else 

[ «- *f*(»,v) U (AC + ,p(AC+)) ; 

return s^^^ U)V y, 



Note that several configurations can coalesce into the same configuration along a branch, but we only 
need to keep one copy of every distinct configuration. Here, we define two configurations to be the identical 
if they satisfy the following two conditions: (1) they contain the same set of gene lineages, and (2) for every 
reticulation node v' in the network, either neither of them contain lineages that have passed through it, or 
the lineages in these two configurations that passed through it originally came from one splitting at node 
v'. The algorithm for calculating the probability of observing a gene tree g given a species network N is 
shown in Alg. [4j The basic idea is similar to the parsimony method we described in the previous section. 
An illustration is given in Fig. [5] 
3.3 Reducing the number of configurations 

At every reticulation node v in the species network, every configuration AC in {AC, w) £ s^^y is split 
into two configurations in all 2 n ( AC > possible ways. This may result in multiple {AC, w) pairs in a s/ c 6' set 
where their configurations have the same set of gene lineages but not considered to be the same because they 
were not originally from one splitting at some reticulation node some lineages in them have passed through. 
It may increase the number of configurations significantly. It is clear that the running time of both these two 
algorithms depends on the number of configurations. So in order to reduce the number of configurations 
so as to speedup the computation, we make use of articulation nodes in the graph (an articulation node 
is a node whose removal disconnects the phylogenetic network). Obviously, the reticulation nodes inside 
the sub-network rooted at an articulation node are independent of the reticulation nodes outside the sub- 
network. So at articulation node v we can clear all the information about the splittings at all reticulation 
nodes under v so that all configurations at v containing the same set of gene lineages are considered to be 
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Algorithm 4: CalProb. 

Input: Phylogenetic network N including topology, branch lengths and inheritance probabilities, gene tree g 
Output: P(g\N) 

while traversing the nodes of N in post-order do 
if node v is a leaf, who has parent u then 

srf c /ci' v <— {{AC, 1)} where AC is the set of leaves in g sampled from the species which v is labeled by; 
^^(u,v) <- CreateCACsForProb(g, (u, v), sfif v y, 

else if node v is a reticulation node, who has child w, and two parents ui and 112 then 

S x <- 0, S 2 <- ; 
foreach (AC,p(AC)) 6 st<e v do 
foreach ACi C AC do 

AC 2 «- AC- Ad; 
S 1 4-{AC 1 ,p{AC) 1 ^); 

^^(uijv) CreateCACsForProb(p, (ui,v), Si); 
£p&( U3tV ) <~ CreateCACsForProb(p, (w 2 , v), S2); 

else if node v is an internal tree node or root, who has two children w\ and W2 then 
foreach (A&,p(ACi)) 6 £Pff( v , wl ) do 
foreach (AC2,p{AC 2 )) G J^^( v ,w 2 ) do 
if ACi and AC2 are compatible then 

e(<g v <- #ftf v U (ACi U AC 2 ,p(ACi)p(AC2)); 

if node D is an internal tree node, who has a parent u then 
I J^^(u,v) 4- CreateCACsForProb(g, («, v),s? c £ v ); 

else 

Let ACj; be the root lineage of the gene tree g ; 
|_ return T,(AC,p(AC))erfn? v Pt(AC, AC R , +oo)p(AC); 



the same. More precisely, when traversing the species network, after constructing s^ c € v for some internal 
tree node v as we have described in Alg. [2] and Alg. |4j if v is an articulation node, we clear all the 
information about splittings at all reticulation nodes in the sub-network rooted at v. Then for counting the 
minimum number of extra lineages, we update s^ c € v to be s^ c S" v such that only the configuration-weight 
pair that has the minimum weight is left, using the statement: szf^^ = {wgmm^ AC xl( ^ Ac ^^^ v xl{AC)}. 
And for computing the probability of the topology of a gene tree, we keep only one copy of every distinct 
configuration in the sense of the set of lineages it contains. More precisely, we update s^ c € v to be srf'io' v 
using = {(AC, w') : w' = T,(AC,w)e^ v W- 

4 Results and Discussion 

To study the performance of the two methods compared to the MUL-tree based ones, we ran all four on 
synthetic data generated as follows. We first generated 100 random 24-taxon species trees using PhyloGen 
ll28l . and from these we generated random species networks with 1, 2, 4, 6 and 8 reticulation nodes. When 
expanding a species network with n reticulation nodes to a species network with n + 1 reticulation nodes, 
we randomly selected two existing edges in the species network and connected their midpoints from the 
higher one to the lower one and then the lower one becomes a new reticulation node. Then, we simulated 
10, 20, 50, 100, 200, 500 and 1000 gene trees respectively within the branches of each species network using 
the ms program ifTOll . Since the MUL-tree methods are computationally very intensive, we employed the 
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following strategy: for the parsimony methods, we bounded the time at 24 hours (that is, killed jobs that did 
not complete within 24 hours). For the probabilistic ones, we bounded the time at 8 hours. All computations 
were run on a computer with a quad-core Intel Xeon, 2.83GHz CPU, and 4GB of RAM. 

For computing the minimum number of extra lineages, the results of the running time of both two 
methods are shown in Fig. [T] Overall, both two methods spent more time on data sets where the species 
networks contain more reticulation nodes. It is not surprising given the fact that adding more reticulation 
nodes increases the complexity of the networks in general. We can see that the speedup of the AC-based 
method over the MUL-tree based method also increased when the number of reticulation nodes in the species 
networks increased. It is up to over 5 orders of magnitude. In this figure, we only plot the results of 
the computations that could finish in 24 hours across all different number of loci sampled. In fact, the 
AC based method finished every computation in less than 3 minutes, even for the largest data set which 
contained species networks with 8 reticulations and 1000 gene trees. For the MUL-tree based one, out of 
100 repetitions the numbers of repetitions that were able to finish in 24 hours across all different loci are 
100, 100, 99, 96 and 88 for data sets containing species networks with 1, 2, 4, 6 and 8 reticulation nodes. 

For computing the probability of the gene tree topologies given a species network, we were not able 
to run the MUL-tree based one because we found it could not finish the computation in 24 hours given 
even for the smallest data set (one gene tree and a species network with one reticulation node). In contrast, 
the AC-based method only needed 0.4 seconds on the same data set which implies a speedup of at least 5 
orders of magnitude. Part of the results of the AC based algorithm are shown in Fig. [2] Again, only the 
results of the computations that could finish successfully in 24 hours across all loci were plotted. We can see 
that the number of data points in the figure decreased significantly when the number of reticulation nodes 
in the species networks increased. In fact, out of 100 repetitions, the numbers of repetitions that finished 
the computations successfully across all different loci are 99, 96, 84, 54 and 32 for data sets containing 
species networks with 1, 2, 4, 6 and 8 reticulation nodes respectively. The number of successful runs is 
much smaller than that for the parsimony method. Furthermore, those computations failed not only because 
of the 24 hours time limit. Part of them are due to memory issues: the number of configurations generated 
in the computation in order to cover all the possible coalescence patterns that could arise is much more than 
that needed in the parsimony method. And the increase in the number of reticulation nodes in the species 
network might result in a very large increase in the number of configurations. 

From Fig. [T]and Fig. [2] we observe that for both methods, the running time differed significantly from 
one data set to another. There are several factors that can affect the number of configurations generated 
during the computation which directly dominates the running time of the algorithm. Two of the factors that 
affect performance are the number of leaves under a reticulation node, as well as the topology of the gene 
tree. We considered a "controlled" data set, where we controlled the placement of the reticulation node 
as well as the shapes of the gene trees. In particular, we considered three networks, each with a single 
reticulation node, yet with 1, 8, and 15 leaves under the reticulation node, respectively (see Fig. [6] in the 
Appendix). Further, we considered two gene trees: gt\, whose topology is "contained" with each of the 
three networks, and gt2, whose disagreement with the three phylogenetic networks is very extensive that all 
coalescence events must occur above the root of the phylogenetic networks (Fig. [6] in the Appendix). We 
ran both AC-based methods on every pair of phylogenetic network and gene tree. 

For the parsimony method, if the gene tree is a contained tree of the species network, it can be reconciled 
into the species network with extra lineages. In this case, for every articulation node v of the network, 
s^ c € v has only one element (AC, w) and n(AC) = 1, and the running time is almost the same for all three 
networks and it is very fast (Table [T] in the Appendix). However, for gene tree 52 whose coalescent events 
have to happen all above the root, for every articulation node v, s^ c € v has only one element (AC, w) and 
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Figure 1: The running times (In of number seconds) of the MUL-tree based (t(MUL)), and AC-based (t(AC)) 
methods for computing parsimonious reconciliations, as well as the speedup logio(t(MUL)/t(AC)). 

n(AC) = q where q equals the number of leaf nodes under v. We know that at a reticulation node every 
configuration AC will give rise to 2 n ( Ac ^ configurations to each of its parents. Therefore, the running time 
of g2 increased when the number of nodes under the reticulation nodes in the species network increased, 
and m who is a parent of the reticulation node h has the largest s^ c € v set and l^/'tfyl = q where q is 
the number of leaves under h (Table [T] in the Appendix). Furthermore, we found that the number of valid 
allele mappings when using the MUL-tree based method is equal to the largest size of stf^v generated for a 
node v during the computation when all the coalescent events have to happen above the root of the species 
network if we do not reduce the number of configurations for articulation nodes. This is easy to see. For the 
AC-based algorithm, if we do not clear the splitting information at articulation nodes, then every element in 
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Figure 2: The running time (In of number of seconds) of the AC -based algorithm for computing the probability of 
gene tree topologies given a species network. The columns from left to right correspond to data sets containing species 
networks with 1, 4 and 8 reticulation nodes, respectively. 

s^^r, where R is the root of the network, represents a different combination of the ways every leaf lineage 
took at every reticulation node. And every valid allele mapping also represents the same thing. However, 
for most of the gene trees, not all coalescent events have to happen above the root, and that is part of where 
the AC -based algorithm improves upon the MUL-tree based one. Comparing g\ and gi we can see that 
for parsimony reconciliations, the more coalescent events that are allowed to occur under reticulation nodes 
with respect to the topology of the gene tree, the faster the method is. 

For the probabilistic method, since we need to keep all configurations so as to cover all possible coa- 
lescence patterns, the gene trees whose coalescent events have to happen above the root become the easiest 
case because they have only one reconciliation. It is exactly the opposite to the parsimony method where 
the gene trees whose coalescent events have to happen above the root take longest running time (Table [2] in 
the Appendix). For the MUL-tree based method, the probability is computed by summing up the probabil- 
ities of all coalescent histories in MUL-tree under all valid allele mappings. However, for most cases using 
ACs to compute the probability of a gene tree given a species tree is much faster than through enumerating 
coalescent histories due to the fact that the number of coalescent histories is much larger than the number 
of configurations generated ll35ll . That is part of the reason why the AC based algorithm outperforms the 
MUL-tree based one for computing the probability in terms of efficiency. 

A third factor that impacts performance is the dependency of the reticulation nodes in the phylogenetic 
network (roughly, how many of them fall on a single path to the root). For parsimonious reconciliations, 
when the reticulation events are independent (or, less dependent), the method is much faster. This is not 
surprising, given that almost all nodes are articulation nodes and the number of ACs is reduced significantly. 
For the probabilistic reconciliation, a similar trend holds, and the dependence of the reticulation nodes 
results in an explosion in the number of ACs. These results are given in more detail in Fig. [7] and Tables [3] 
and [4] in the Appendix. 

To sum up, for the data sets of the same size (e.g., number of taxa and reticulation nodes), the running 
time of the AC -based algorithms increases when there are more leaves under reticulation nodes and when 
the reticulation nodes are more dependent on each other. With respect to the topology of the gene tree and 
the species network, the more coalescent events that are allowed under reticulation nodes the faster the par- 
simony method is, and the opposite for the probabilistic method. For most cases, the AC -based methods 
are significantly much faster than the MUL-tree based ones. For parsimony, the gain in terms of efficiency 
comes from avoiding considering useless allele mappings including the ones that cannot yield the optimal 
reconciliation implied by the coalesced lineages in the configurations and the ones that correspond to the 
configurations being removed at articulation nodes. For probabilistic reconciliation, the gain comes from 
two parts. One is also avoiding considering useless allele mappings by removing corresponding configu- 
rations at articulation nodes. The other is using AC to compute the probability instead of enumerating the 
coalescent histories. 
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APPENDIX 




Figure 3: A phylogenetic network N, a gene tree g, and the eight possible coalescent histories of g within the branches 
of N. Here, one allele is sampled from taxa A and C, and two alleles from taxon B. 



Phylogenetic network MULtree Valid allele mappings 




Figure 4: Illustration of the conversion from a phylogenetic network to a MUL-tree, along with all valid allele 
mappings associated with the case in which single alleles a, b, c and d were sampled from each of the four species A, 
B, C and D, respectively. 



Table 1 : The results of running the AC-based algorithm for computing the minimum number of extra lineages given gene trees 
and species networks in Fig. [6] l-e/^hl is the number of configurations at the reticulation node h and max\srf c €\ is the maximum 
number of configurations generated at a node during computation. We labeled the first node v in post-order of traversal that contains 
the largest AC V set by m in Fig. [6] Furthermore, the last column is the number of valid allele mappings if using the MUL-tree 
base d method. 
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Figure 5: (Left) A phylogenetic network with configurations generated when counting the minimum number of extra 
lineages of the gene tree in the middle. (Right) A phylogenetic network with configurations generated when computing 
the probability of the topology of the gene tree in the middle. Configurations (along with arrows) in blue represent 
configurations generated for nodes and configurations (along with arrows) in red represent configurations generated for 
branches. The weight of every configuration is not included in the figure. The root has two configurations {a, b, c, d}, 
but they are not the same because one represents the scenario where a went left and b, c and d went right at the 
reticulation node, and the other represents the scenario where a and c went left and b and d went right at the reticulation 
node. 



Table 2: The results of running the AC-based algorithm for computing the probability of gene tree topologies given gene trees 
and species networks in Fig. [6] \^/^h\ is the number of configurations at the reticulation node h and max\srf c €\ is the maximum 
number of configurations generated at a node during computation. We labeled the first node v in post-order of traversal that contains 
the largest AC V set by m in Fig. [6] Furthermore, the last column is the number of valid allele mappings if using the MUL-tree 
based method. 
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Table 3: The results of running the AC-based algorithm for computing the minimum number of extra lineages given gene trees 
and species networks in Fig. [7] \sfifh\ is the number of configurations at the highest reticulation node h and max\srf c £\ is the 
maximum number of configurations generated at a node during computation. We labeled the first node v in post-order of traversal 
that contains the largest AC V set by m in Fig. [7] Furthermore, the last column is the number of valid allele mappings if using the 
MUL-tree based method. 
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Table 4: The results of running the AC-based algorithm for computing the probability of gene tree topologies given gene trees 
and species networks in Fig. [JJ \si e €}\ is the number of configurations at the highest reticulation node h and max\s^ c ^'\ is the 
maximum number of configurations generated at a node during computation. We labeled the first node v in post-order of traversal 
that contains the largest AC V set by m in Fig. [7J Furthermore, the last column is the number of valid allele mappings if using the 
MUL-tree based method. 
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Figure 6: Synthetic data with controlled placements of the reticulation nodes. (Top) A species tree ST. (Middle) Ni, 
N2 and N3 are three species networks constructed by adding one reticulation edge to ST at three different locations. 
(Bottom) Two gene trees gx, which is contained in all three networks, and g%, whose coalescent events have to happen 
above the root of all three networks. 
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Figure 7 : The effects of dependency of reticulation nodes in the species network and different gene tree topologies on 
the running time of the AC -based algorithms. (Left) A species tree ST. (Middle) Ni and N2 are two species networks 
constructed by adding seven reticulation edges to ST at different locations. (Right) two gene trees g\, which is a 
contained tree of both N\ and Ni, and cj2 whose coalescent events have to happen above the root of both two species 
networks. 



17 



